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Abstract. Founded in 2010, the Taiwan Extragalactic Astronomical Data Center 
(TWEA-DC) has for goal to propose access to large amount of data for the Taiwanese 
and International community, focusing its efforts on Extragalactic science. In continu- 
ation with individual efforts in Taiwan over the past few years, this is the first stepping- 
stone towards the building of a National Virtual Observatory. Taking advantage of our 
own fast indexing algorithm (BLINK), based on a octahedral meshing of the sky cou- 
pled with a very fast kd-tree and a clever parallelization amongst available resources, 
TWEA-DC will propose from spring 2013 a service of "on-the-fly" matching facility, 
between on-site and user-based catalogs. We will also offer access to public and private 
raw and reducible data available to the Taiwanese community. Finally, we are devel- 
oping high-end on-line analysis tools, such as an automated photometric redshifts and 
SED fitting code (APz), and an automated groups and clusters finder (APFoF). 



1. Introduction: Data Intensive Astronomy 

The recent breakthroughs in telescopes, detectors, and also computer technology al- 
low astronomical instruments to produce large amount of images and catalogs. It is 
today easier to "dial-up" a part of the sky than wait many months to have access to 
a telescope. With the advent of inexpensive storage technologies and the availability 
of high-speed networks, the concept of multi-terabyte on-line databases interoperat- 
ing seamlessly is no longer outlandish. More and more catalogs are now interlinked, 
crossing wavelengths boundaries. Furthermore the new generation of survey telescopes 
(Pan-STARRS, LSST, etc) will image the entire sky every few days and yield Petabytes 
of data. Over the past decade the concept of the Virtual Observatory (VO) has emerged 
rapidly to address challenges relating to data management, analysis, distribution and 
interoperability. The VO is a system in which the vast astronomical archives and 
databases around the world, together with analysis tools and computational services, 
are linked together into an integrated facility. Data centers play a central role, by pro- 
viding not only a good quality service to the community (data base and software suites), 
but also added value based on expertise (full data analysis or research environments). 
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2. The Taiwan Extragalactic Data Center: missions and goals 

The Taiwanese astronomical community is now stepping into the VO era. These efforts 
in Taiwan are possible through the creation of the first Taiwan based Data Center ded- 
icated to extragalactic astronomy funded by the National Taiwan Normal University in 
2010. The Taiwan Extragalactic Astronomical Data Center (TWEA-DCfl is designed 
to be a fully functioning data center around which the community can work, enabling 
Taiwan to join the international VO community. The efforts conducted by the VO com- 
munity are already in very advanced stages, and therefore we work on the base of their 
latest developments, and include the available applications developed by the interna- 
tional community over the past decades. 

Promoting the concept and training the new generation: 

One of the major goals of the TWEA-DC is to prepare the next generation of as- 
tronomers, who will have to keep up pace with the changing face of modern Astron- 
omy. Moving into the VO era will have a dramatic impact on the existing skill base of 
young astronomers. Also, by making a move now in this direction, Taiwan is prepar- 
ing the next generation of scientists to face the technological revolution. The project 
had involved several graduate students since its start, who are responsible for the dif- 
ferent crucial stage of the build up of the DC, from the implementation of Hardware, 
Software, Security, and Database. Several classes have been organized during the past 
year to train the students to the latest techniques in Computer science. Two teams of 
computer scientists (from the National Central University in Taiwan, and the Advanced 
Computer Science Epitech Laboratory in France), including both faculty and students, 
are contributing fully to the TWEA-DC. In fact some graduate students from Astron- 
omy and Computer science groups are paired and develop part of the project (such as 
APz, see section 3). It is also crucial to raise awareness of the problems related to big 
data sets amongst Astronomers, who tend to discard these issues too easily. We orga- 
nized workshops in taiwan on this topic and managed to have an increasing community 
gathered around the project, developing the scientific and technical cases. 

Data Center: Database and Archive hosting: 

Astronomy is now based on of large datasets, covering a broad wavelength range. The 
challenge is to aggregate the information and generate a final product that will bridge 
different expertise and generate an enhanced scientific output. Large amounts of data 
storage are required locally to enable a fast access to images and catalogs. 
Database, hosted catalogs, and peer-to-peer: our relational database grant access to 
some of the largest public catalogs available (such as SDSS, 2MASS, UKIDSS, CFHTLS, 
SWIRE, 2DF, DEEP2, VVDS). Our DC is VO-friendly, enabling access to any dataset 
available in the network. We also plan to propose a peer-to-peer hosting system, where 
users can upload their own catalog and make it available to the community. 
Archive for Taiwanese projects: we also aim to archive raw and reduced data for the 
Taiwanese community, such as Taiwanese-PI CFHT data, Lulin Observatory datasets, 
but also privately owned data as part of Taiwan's involvement in Pan-STARRs, Subaru- 
SUMIRE, or Palomar-SED machine. 
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Service-based: software development: 

One important mission of a Data Center is to provide the community for user-friendly 
tools to perform high-level analysis remotely. The large size of data available prevents 
users to transfer them across the network. The new generation of astronomical analysis 
will be conducted remotely on High Performance Computing Data Centers, creating a 
massive pa radigm shift from our current model of astronomical research (the Fourth 
Paradigm - I5ev et all 120091) . Some basic tools such as plotting and database queries 
are provided by the TWEA-DC, as well as access to some of the wide amount of tools 
already developed by the VO community. In addition we are developing our own set of 
dedicated analysis tools. Beyond the matching procedure, we will provide the commu- 
nity with a service that will take fully advantage of our s erver and eventually pr opose 
in the future a fully dedicated Domain Specific Language dKamennoff et al.|[2012aT) . 



3. The Taiwan Extragalactic Data Center: implementations 
Current status: hardware, structure and database: 

The current version of TWEA-DC gathers 192Tb of data storage. The Data center has 
been exclusively funded by the National Taiwan Normal University. The structure of 
the TWEA-DC is tailored for rapid data access and is composed of two servers, a large 
data storage unit and a backup System. The communication speed between the servers 
and the data units is of 4Gb/s, while the backup system operates at a rate of lGb/s. 
Our DC also involves an heterogenous collection of computers and we are planning 
in integrating some GPUs in the system. The current DC Data Warehouse structure is 
simple: at one end, the required data are gathered from local or on-line archives using 
a relational database (MySQL) and some IVOA tools : DAL (Data Access Layer) and 
VOTable format; at the other end, the users interface and external tools are managed by 
a web application. The strength of our DC is actually its middleware, standing between 
user interface and data: its frontend is managed as a web service and it is designed 
to run on parallel IT systems. The software architecture is optimized for concurrent 
multi-user access and to take advantage of a various kind of hardware : multi com- 
puter systems (grid, cluster, cloud), GPGPUs, etc. We expect to release access to our 
database and some of the associated tools for the community in spring 2013 and our 
full archives towards the end of 2013. 



The core software: Billion Line INdexing in a clicK (BLINK): 

BLINK is designed to index very rapidly a large amou nt of data according to th eir po- 
sitions on the sky, enabling a "on-the-fly" matching (Kamen noff et al.1 [2012b ). This 
functionality is also crucial to organize data and match objects from various inputs and 
will be used by other software. Th e indexing engine of BLINK is based on the Hi- 
erarchical Triangular Mesh (HTM - [Szalay et al.ll2007t) . which describes a Quad-Tree 
system able to locate and identify objects on a sphere. Also, BLINK is developed to 
be deployed on heterogeneous parallel syst ems, enhancing greatly the speed of indexa- 
tion, through a P2P distributed system (c.f. lTang et alj|2010h . A first version of BLINK 
should be available in spring 2013. To enhance performance and flexibility, BLI NK 
will in the future combine other indexation systems (HEALPix - iGorski et al.ll2005ft . or 
even indexation not based on the position, but taking advantage of the wide range of 
parameters available (fluxes, shapes, compactness, etc.). 
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Automated Photometric Redshifs (APz): 

Redshifts are essential in any study related to galaxy evolution as well as cosmology. 
However computing redshifts from spectroscopy is very telescope time consuming, 
and will not even be possible with the next generation of large sky surveys. Fortu- 
nately, such information can be obtained using multi-band photometric information, 
through comparisons with a training set of spectroscopic redshifts or using template fit- 
ting. APz is using a combination of supervised machine learning algorithms, k-Nearest 
Neighbor (kNN) and Support Vector Machine (SVM) to correlate color information 
of ga laxies for which reds hift is unknown to galaxies with known redshifts (refer to 
e.g.. iBall & Brunnenl201fi) . The strength of APz is that it will built a super training 



set from all information available in the database, and then train the algorithm on this 
dataset offline. Once the algorithm trained, users will be able to get photometric red- 
shfits "on-the-fly" for their own catalog. APz will also provide the redshift Probability 
Distribution function for each individual object (pdz), and will also be able to provide 
physical information such as stellar masses, age, dust extinction, etc., by comparing 
with Synthetic Population models. APz should be available by the end of 2013. 

Automated Probabilistic Friend-of-Friend algorithm (APFoF): 

Properties of galaxies correlate strongly with their close environment, which can be 
probed by identifying directly overdensities such as galaxy groups and clusters. How- 
ever poor precision on redshifts (as obtained via photometry) prevent to conduct such 
study. We have developed an algorithm which take advantage of the pdz to identify 
overdensities from photom etric redshift g alaxy surveys - the probabilistic Friends-of- 
Friends algorithm (PFoF - iLiu et ai1l2008h . The parameters for PFoF require training 
based on catalogs of known groups and clusters which could be feed in our DC, en- 
abling a fully optimized implementation of the algorithm. We are planning in the future 
to implement theoretical catalogs (from N-body simulation) within the DC, which cou- 
pled with PFoF will provide additional information (such as Dark Matter Halo masses). 
APFoF should be available on our DC in early 2014. 



Acknowledgments. SF acknowledges the travel support by the National Science 
Council of Taiwan under the grant NSC101-2914-I-003-023-A1. SF and NK are grate- 
ful to the LOC of the ADASS XXII for the support of the conference fees. 



References 



Ball, N. M., & Brunner, R. J. 2010, International Journal of Modern Physics D, 19, 1049 

Gorski, K. M., Hivon, E., Banday, A. J., Wandelt, B. D., Hansen, F. K., Reinecke, M., & 
Bartelmann, M. 2005, ApJ, 622, 759 

Hey, T., Tansley, S., & Talle, K. 2009, The Fourth Paradigm: Data-Intensive Scientific Discov- 
ery (Microsoft Research) 

Kamennoff, N., Foucaud, S., & Reybier, S. 2012a, in ADASS XXII, edited by D. Friedel, 
M. Freemon, & R. Plante, vol. TBD of ASP Conf. Series, TBD 

Kamennoff, N., Foucaud, S., Reybier, S., Tsai, M.-F, & Tang, C.-H. 2012b, in ADASS XXI, 
edited by P. Ballester, D. Egret, & N. P. F. Lorente, vol. 461 of ASP Conf. Series, 541 

Liu, H. B., Hsieh, B. C, Ho, P. T. P., Lin, L., & Yan, R. 2008, ApJ, 681, 1046 

Szalay, A. S., Gray, J., Fekete, G, Kunszt, P. Z., Kukol, P., & Thakar, A. 2007, eprint 
arXiv:cs/0701164 

Tang, C.-H., Huang, A.-C, Tsai, M.-F, & Wang, W.-J. 2010, in 2010 International Computer 
Symposium (ICS), edited by H.-C. Hsiao, IEEE Conf. Series, 869 



